{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8644579b",
   "metadata": {},
   "source": [
    "## Description:\n",
    "这个笔记本主要是分析下数据与简单的数据预处理，主要包括：\n",
    "1. 数据简单分析，聚焦用户的点击序列\n",
    "2. 数据预处理主要是用户数据的性别，年龄确定与编码， 文章画像数据的缺失值填充与编码\n",
    "3. 日志数据拼接上用户和文章画像，并保存"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ed72ed9",
   "metadata": {},
   "source": [
    "## 导入包"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ba3b96d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import time\n",
    "import random\n",
    "from datetime import datetime\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba45105c",
   "metadata": {},
   "source": [
    "## 导入数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "12a69b53",
   "metadata": {},
   "outputs": [],
   "source": [
    "base_path = 'all_data'\n",
    "doc_info_path = os.path.join(base_path, 'doc_info.txt')\n",
    "train_data_path = os.path.join(base_path, 'sample_2w_data.csv')\n",
    "user_info_path = os.path.join(base_path, 'user_info.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "f799b716",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 读取数据\n",
    "doc_info = pd.read_csv(doc_info_path, delimiter='\\t', names=['article_id', 'title','ctime', 'img_num', 'cat_1', 'cat_2', 'key_words'])\n",
    "train_data = pd.read_csv(train_data_path)\n",
    "user_info = pd.read_csv(user_info_path, delimiter='\\t', names=['user_id', 'device', 'os', 'province', 'city', 'age', 'gender'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "9255152f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(592749, 7) (3939989, 8) (1538384, 7)\n"
     ]
    }
   ],
   "source": [
    "print(doc_info.shape, train_data.shape, user_info.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38aee5e0",
   "metadata": {},
   "source": [
    "## 初识数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "d389238c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>article_id</th>\n",
       "      <th>title</th>\n",
       "      <th>ctime</th>\n",
       "      <th>img_num</th>\n",
       "      <th>cat_1</th>\n",
       "      <th>cat_2</th>\n",
       "      <th>key_words</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>349635709</td>\n",
       "      <td>拿到c1驾照后,实习期扣分了会怎样?扣12分驾照会吊销么?</td>\n",
       "      <td>1572519971000</td>\n",
       "      <td>9</td>\n",
       "      <td>汽车</td>\n",
       "      <td>汽车/用车</td>\n",
       "      <td>上班族:8.469502,买车:8.137443,二手车:9.022247,副页:11.21...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>361653323</td>\n",
       "      <td>疫情谣言粉碎机丨接种新冠疫苗后用麻药或致死?盘点最新疫情谣言,别被忽悠了</td>\n",
       "      <td>1624522285000</td>\n",
       "      <td>1</td>\n",
       "      <td>健康</td>\n",
       "      <td>健康/疾病防护治疗及西医用药</td>\n",
       "      <td>医生:14.760494,吸烟:16.474872,板蓝根:15.597788,板蓝根^^熏...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>426732705</td>\n",
       "      <td>实拍本田飞度:空间真大,8万出头工薪族可选,但内饰能忍?</td>\n",
       "      <td>1610808303000</td>\n",
       "      <td>9</td>\n",
       "      <td>汽车</td>\n",
       "      <td>汽车/买车</td>\n",
       "      <td>155n:8.979802,polo:7.951116,中控台:5.954278,中网:7....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>430221183</td>\n",
       "      <td>搭载135kw电机比亚迪秦plus纯电动版外观更精致</td>\n",
       "      <td>1612581556000</td>\n",
       "      <td>2</td>\n",
       "      <td>汽车</td>\n",
       "      <td>汽车/买车</td>\n",
       "      <td>etc:12.055207,代表:8.878175,内饰:5.342025,刀片:9.453...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>441756326</td>\n",
       "      <td>【提车作业】不顾他人眼光帕萨特phev俘获30老男人浪子心</td>\n",
       "      <td>1618825835000</td>\n",
       "      <td>23</td>\n",
       "      <td>汽车</td>\n",
       "      <td>汽车/买车</td>\n",
       "      <td>丰田凯美瑞:12.772149,充电器:8.394001,品牌:8.436843,城市:7....</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   article_id                                 title          ctime img_num  \\\n",
       "0   349635709         拿到c1驾照后,实习期扣分了会怎样?扣12分驾照会吊销么?  1572519971000       9   \n",
       "1   361653323  疫情谣言粉碎机丨接种新冠疫苗后用麻药或致死?盘点最新疫情谣言,别被忽悠了  1624522285000       1   \n",
       "2   426732705          实拍本田飞度:空间真大,8万出头工薪族可选,但内饰能忍?  1610808303000       9   \n",
       "3   430221183            搭载135kw电机比亚迪秦plus纯电动版外观更精致  1612581556000       2   \n",
       "4   441756326         【提车作业】不顾他人眼光帕萨特phev俘获30老男人浪子心  1618825835000      23   \n",
       "\n",
       "  cat_1           cat_2                                          key_words  \n",
       "0    汽车           汽车/用车  上班族:8.469502,买车:8.137443,二手车:9.022247,副页:11.21...  \n",
       "1    健康  健康/疾病防护治疗及西医用药  医生:14.760494,吸烟:16.474872,板蓝根:15.597788,板蓝根^^熏...  \n",
       "2    汽车           汽车/买车  155n:8.979802,polo:7.951116,中控台:5.954278,中网:7....  \n",
       "3    汽车           汽车/买车  etc:12.055207,代表:8.878175,内饰:5.342025,刀片:9.453...  \n",
       "4    汽车           汽车/买车  丰田凯美瑞:12.772149,充电器:8.394001,品牌:8.436843,城市:7....  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_info.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "d19c4897",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>device</th>\n",
       "      <th>os</th>\n",
       "      <th>province</th>\n",
       "      <th>city</th>\n",
       "      <th>age</th>\n",
       "      <th>gender</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1000372820</td>\n",
       "      <td>TAS-AN00</td>\n",
       "      <td>Android</td>\n",
       "      <td>广东</td>\n",
       "      <td>广州</td>\n",
       "      <td>A_0_24:0.404616,A_25_29:0.059027,A_30_39:0.516...</td>\n",
       "      <td>female:0.051339,male:0.948661</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1000652892</td>\n",
       "      <td>PACM00</td>\n",
       "      <td>Android</td>\n",
       "      <td>河北</td>\n",
       "      <td>唐山</td>\n",
       "      <td>A_0_24:0.615458,A_25_29:0.086233,A_30_39:0.141...</td>\n",
       "      <td>female:0.280295,male:0.719705</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1000908852</td>\n",
       "      <td>MI6X</td>\n",
       "      <td>Android</td>\n",
       "      <td>上海</td>\n",
       "      <td>上海</td>\n",
       "      <td>A_0_24:0.123255,A_25_29:0.208225,A_30_39:0.298...</td>\n",
       "      <td>female:0.000000,male:1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1001168798</td>\n",
       "      <td>iPhone11</td>\n",
       "      <td>IOS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A_0_24:0.436296,A_25_29:0.489370,A_30_39:0.061...</td>\n",
       "      <td>female:0.870710,male:0.129290</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1001305614</td>\n",
       "      <td>M2103K19C</td>\n",
       "      <td>Android</td>\n",
       "      <td>江苏</td>\n",
       "      <td>苏州</td>\n",
       "      <td>A_0_24:0.006632,A_25_29:0.043408,A_30_39:0.350...</td>\n",
       "      <td>female:0.000000,male:1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      user_id     device       os province city  \\\n",
       "0  1000372820   TAS-AN00  Android       广东   广州   \n",
       "1  1000652892     PACM00  Android       河北   唐山   \n",
       "2  1000908852       MI6X  Android       上海   上海   \n",
       "3  1001168798   iPhone11      IOS      NaN  NaN   \n",
       "4  1001305614  M2103K19C  Android       江苏   苏州   \n",
       "\n",
       "                                                 age  \\\n",
       "0  A_0_24:0.404616,A_25_29:0.059027,A_30_39:0.516...   \n",
       "1  A_0_24:0.615458,A_25_29:0.086233,A_30_39:0.141...   \n",
       "2  A_0_24:0.123255,A_25_29:0.208225,A_30_39:0.298...   \n",
       "3  A_0_24:0.436296,A_25_29:0.489370,A_30_39:0.061...   \n",
       "4  A_0_24:0.006632,A_25_29:0.043408,A_30_39:0.350...   \n",
       "\n",
       "                          gender  \n",
       "0  female:0.051339,male:0.948661  \n",
       "1  female:0.280295,male:0.719705  \n",
       "2  female:0.000000,male:1.000000  \n",
       "3  female:0.870710,male:0.129290  \n",
       "4  female:0.000000,male:1.000000  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user_info.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "18c83118",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>article_id</th>\n",
       "      <th>expo_time</th>\n",
       "      <th>net_status</th>\n",
       "      <th>flush_nums</th>\n",
       "      <th>exop_position</th>\n",
       "      <th>click</th>\n",
       "      <th>duration</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>464467760</td>\n",
       "      <td>2021-06-30 09:57:14</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>13</td>\n",
       "      <td>1</td>\n",
       "      <td>28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>463850913</td>\n",
       "      <td>2021-06-30 09:57:14</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>15</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>464022440</td>\n",
       "      <td>2021-06-30 09:57:14</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>17</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>464586545</td>\n",
       "      <td>2021-06-30 09:58:31</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>465352885</td>\n",
       "      <td>2021-07-03 18:13:03</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>18</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      user_id  article_id            expo_time  net_status  flush_nums  \\\n",
       "0  1000541010   464467760  2021-06-30 09:57:14           2           0   \n",
       "1  1000541010   463850913  2021-06-30 09:57:14           2           0   \n",
       "2  1000541010   464022440  2021-06-30 09:57:14           2           0   \n",
       "3  1000541010   464586545  2021-06-30 09:58:31           2           1   \n",
       "4  1000541010   465352885  2021-07-03 18:13:03           5           0   \n",
       "\n",
       "   exop_position  click  duration  \n",
       "0             13      1        28  \n",
       "1             15      0         0  \n",
       "2             17      0         0  \n",
       "3             20      0         0  \n",
       "4             18      0         0  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "007e555c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "20000"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 交互数据中用户和文章数量\n",
    "train_data['user_id'].nunique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "1e6dc4b4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "114620"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data['article_id'].nunique()  # 万篇文章"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "bfc175fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 曝光时间探索  毫秒级时间戳 \n",
    "# 转换成python的时间格式\n",
    "#train_data['expo_time'] = train_data['expo_time'].apply(lambda x: datetime.fromtimestamp(float(x)/1000) \\\n",
    "#                                                        .strftime('%Y-%m-%d %H:%M:%S'))\n",
    "train_data['expo_time'] = pd.to_datetime(train_data['expo_time'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "051781cb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2021-06-30 00:00:01    2021-07-06 23:59:59\n"
     ]
    }
   ],
   "source": [
    "print(train_data['expo_time'].min(), \"  \", train_data['expo_time'].max())   # 从6月24到7月6号  12天的行为日志\n",
    "# 后面把从6月24到7月5号的行为数据当做训练集， 7月6号的数据当做测试集进行测试"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "4dcc8c82",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    3353933\n",
       "1     586056\n",
       "Name: click, dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data['click'].value_counts()    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a193037b",
   "metadata": {},
   "source": [
    "## 点击序列分析"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "8a80cf13",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 选出点击的来\n",
    "click_df = train_data[train_data['click']==1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "17d1924e",
   "metadata": {},
   "outputs": [],
   "source": [
    "click_df['date'] = click_df['expo_time'].dt.date"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "a2ac5b7c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(234045, 9)"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "click_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "588cb101",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "20000"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "click_df['user_id'].nunique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "33bc3d99",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "48383"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "click_df['article_id'].nunique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "8a7ca97c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "date\n",
       "2021-06-30     57053\n",
       "2021-07-01     55657\n",
       "2021-07-02     53724\n",
       "2021-07-03     50291\n",
       "2021-07-04     78487\n",
       "2021-07-05    146634\n",
       "2021-07-06    144210\n",
       "Name: click, dtype: int64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 看看每一天的点击分布\n",
    "click_df.groupby('date')['click'].apply(lambda x: x.count())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "5a25d403",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    20000.000000\n",
       "mean        29.302800\n",
       "std         19.568923\n",
       "min         10.000000\n",
       "25%         14.000000\n",
       "50%         23.000000\n",
       "75%         38.000000\n",
       "max        100.000000\n",
       "Name: click, dtype: float64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 看看每个用户的次数分布\n",
    "click_df.groupby('user_id')['click'].apply(lambda x: x.count()).describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "57344dae",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "user_id\n",
       "17340         46\n",
       "490452        58\n",
       "4419744       86\n",
       "4564792       38\n",
       "5419396       19\n",
       "              ..\n",
       "2447162784    30\n",
       "2447169712    18\n",
       "2447202996    10\n",
       "2447212098    11\n",
       "2447231894    17\n",
       "Name: click, Length: 20000, dtype: int64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "click_df.groupby('user_id')['click'].apply(lambda x: x.count())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "28d7082c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "click\n",
       "10       1217\n",
       "11       1082\n",
       "12        988\n",
       "13        914\n",
       "14        816\n",
       "         ... \n",
       "99         25\n",
       "95         23\n",
       "97         23\n",
       "100        20\n",
       "93         16\n",
       "Length: 91, dtype: int64"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(click_df.groupby('user_id')['click'].apply(lambda x: x.count())).value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "b9aaa284",
   "metadata": {},
   "outputs": [],
   "source": [
    "click_nums = pd.DataFrame(click_df.groupby('user_id')['click'].apply(lambda x: x.count()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "82557d86",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(31164, 1)"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "click_nums.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "794d87c5",
   "metadata": {},
   "source": [
    "一个人最多点击了1111次，这种制作划窗的时候，肯定要注意下， 把次数限制在40吧\n",
    "\n",
    "窗口的话， 这里得划分下：\n",
    "* 小于10的， 步长为1进行滑动\n",
    "* 长度10-20的， 步长为2滑动\n",
    "* 长度20-30的， 步长为3滑动\n",
    "* 长度30-40的，步长为4滑动\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "abb84145",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<AxesSubplot:>"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAiVUlEQVR4nO3deXxV9Z3/8dfn3ps9kIUEBJKwCEpRQTQiaqtWa6vVcelicWpdO9THr07tPnaWn52lHTvt1Npp1TJq1bY/17YjddxtXVo3QimbLAYQEtYASQhkTz6/P+4BwxrIdpJ73s+Hedx7vufcez+5nrzv4Xu/53vM3RERkWiIhV2AiIgMHIW+iEiEKPRFRCJEoS8iEiEKfRGRCEmEXcDhFBUV+fjx48MuQ0RkSFmwYME2dy8+2LpBHfrjx4+noqIi7DJERIYUM1t3qHXq3hERiRCFvohIhCj0RUQiRKEvIhIhCn0RkQhR6IuIRIhCX0QkQlIy9Osb27jzxXdZXF0XdikiIoNKt6FvZveb2VYzW9ql7ftmtsLMFpvZb80sv8u6b5lZpZmtNLOPdWm/MGirNLNb+/w36SIWgzteXMWfKrf358uIiAw5R3Kk/wBw4X5tLwAnuvs0YBXwLQAzmwrMBk4IHnOXmcXNLA78FLgImApcFWzbL4ZlplE8LIM1Nbv66yVERIakbkPf3V8FduzX9ry7tweLbwIlwf3LgEfcvcXd1wKVwMzgp9Ld17h7K/BIsG2/mViUw5ptu/vzJUREhpy+6NO/AXgmuD8WqOqyrjpoO1T7AcxsjplVmFlFTU1Nj4uaWJyrI30Rkf30KvTN7B+AduBXfVMOuPtcdy939/Li4oNOEndEJhblUNvYRu3u1r4qTURkyOtx6JvZdcAlwGf9/aurbwBKu2xWErQdqr3fTCzOAWDNNh3ti4js0aPQN7MLgW8Cl7p7Y5dV84DZZpZhZhOAycDbwHxgsplNMLN0kl/2zutd6Yc3sTgXgNU16tcXEdmj2/n0zexh4FygyMyqgdtIjtbJAF4wM4A33f0md19mZo8B75Ds9vmiu3cEz3Mz8BwQB+5392X98PvsVVqQRVrcWKPQFxHZq9vQd/erDtJ832G2/w7wnYO0Pw08fVTV9UIiHqOsMFtf5oqIdJGSZ+TuMbE4V8M2RUS6SPHQz2Hd9t20d3SGXYqIyKCQ0qF/bFEubR1OdW1T2KWIiAwKKR36GrYpIrKvFA/95LBNjeAREUlK6dAvzEknPztNY/VFRAIpHfoQTLymYZsiIkAUQl/DNkVE9opA6OdQ09BCQ3Nb2KWIiIQu9UO/SF/miojskfKhf6yGbYqI7JXyoV82IpuY6UhfRAQiEPoZiTilhdkKfRERIhD6AJOKc1m1pSHsMkREQheJ0J9Wkk9lzS6N4BGRyItE6M8oy8cdFlfXh12KiEioIhH600vzAVi4vjbcQkREQhaJ0M/LSmPSyFwWrq8LuxQRkVBFIvQBZpTms7CqDncPuxQRkdBEJvRPLstnx+5W1u9oDLsUEZHQRCb0Z5QWAKiLR0QiLTKhf9yoXLLT4/oyV0QiLTKhn4jHmFaSx8KqurBLEREJTWRCH2BGWQHvbNxJc1tH2KWIiISi29A3s/vNbKuZLe3SVmhmL5jZu8FtQdBuZvZjM6s0s8VmdkqXx1wbbP+umV3bP7/O4c0ozae901m6QSdpiUg0HcmR/gPAhfu13Qq85O6TgZeCZYCLgMnBzxzgbkh+SAC3AacDM4Hb9nxQDKSTy/IB+Iu6eEQkoroNfXd/FdixX/NlwIPB/QeBy7u0P+RJbwL5ZjYa+BjwgrvvcPda4AUO/CDpdyOHZVJSkKURPCISWT3t0x/l7puC+5uBUcH9sUBVl+2qg7ZDtR/AzOaYWYWZVdTU1PSwvEObUVagETwiElm9/iLXk6e49tlpru4+193L3b28uLi4r552rxml+Wysb2ZzfXOfP7eIyGDX09DfEnTbENxuDdo3AKVdtisJ2g7VPuBmTigE4E+V28J4eRGRUPU09OcBe0bgXAs82aX9mmAUzyygPugGeg74qJkVBF/gfjRoG3BTRw9nRE46r77b911HIiKDXaK7DczsYeBcoMjMqkmOwrkdeMzMbgTWAVcGmz8NfByoBBqB6wHcfYeZ/SswP9juX9x9/y+HB0QsZpx9XDGvrKqhs9OJxSyMMkREQtFt6Lv7VYdYdf5BtnXgi4d4nvuB+4+qun5y9nFF/HbhBpZt3MlJJXlhlyMiMmAidUbuHh+anPyCWF08IhI1kQz9otwMThgznFdWKfRFJFoiGfoAZx9XzJ/X1epi6SISKdEN/cnFtHc6r6/eHnYpIiIDJrKhf+q4AnLS47yqLh4RiZDIhn56IsYZx47g1XdrdN1cEYmMyIY+wDnHFVO1o4n3tuu6uSISDZEO/bOPC4ZuqotHRCIi0qE/bkQOY/OzePu9UE4OFhEZcJEOfYCTS/NZpIuqiEhERD70p5fmUV3bxPZdLWGXIiLS7yIf+tNK8gFYXK3r5opI6ot86J80No+Y6bq5IhINkQ/9nIwEk0cOY3F1XdiliIj0u8iHPsC0kjwWVdfrJC0RSXkKfWB6aT47drdSXdsUdikiIv1KoU9y2CbAInXxiEiKU+gDxx8zjPRETOP1RSTlKfSBtHiME8YMZ1GVhm2KSGpT6Aeml+SzZEM97R2dYZciItJvFPqB6aV5NLV1UFmzK+xSRET6jUI/MD04M1f9+iKSyhT6gfEjchiWmWCRpmMQkRSm0A/EYsb0knwWvFdLR6dO0hKR1NSr0Dezr5jZMjNbamYPm1mmmU0ws7fMrNLMHjWz9GDbjGC5Mlg/vk9+gz50/gdGsnJLA5/52Ru8t2132OWIiPS5Hoe+mY0FvgSUu/uJQByYDXwPuMPdJwG1wI3BQ24EaoP2O4LtBpXrzhzPHZ+ZzsotDVx052v84o33NDWDiKSU3nbvJIAsM0sA2cAm4DzgiWD9g8Dlwf3LgmWC9eebmfXy9fuUmXHFjBKe/8rZlI8v4J+eXMZLy7eGXZaISJ/pcei7+wbgB8B6kmFfDywA6ty9PdisGhgb3B8LVAWPbQ+2H7H/85rZHDOrMLOKmppwrl07Oi+L+649jfREjLfWbg+lBhGR/tCb7p0CkkfvE4AxQA5wYW8Lcve57l7u7uXFxcW9fboeS0/oLF0RST296d75CLDW3WvcvQ34DXAWkB909wCUABuC+xuAUoBgfR4wqA+jdZauiKSa3oT+emCWmWUHffPnA+8AfwA+FWxzLfBkcH9esEyw/vc+yL8lPbk0n6a2Dt7dqrN0RSQ19KZP/y2SX8j+GVgSPNdc4O+Ar5pZJck++/uCh9wHjAjavwrc2ou6B8TeKZd1lq6IpIhE95scmrvfBty2X/MaYOZBtm0GPt2b1xto40Zkk5eVxl+q6pg9syzsckREek1n5B6GmTG9NF8XTReRlKHQ78bJJXms2tJAY2t79xuLiAxyCv1unFyWT6fD0g07wy5FRKTXFPrdmKYpl0UkhSj0u1GUm0FJQZb69UUkJSj0j4C+zBWRVKHQPwIzSvPZUNdETUNL2KWIiPSKQv8ITA9O0lpcXRdqHSIivaXQPwInjBlOPGbq4hGRIU+hfwSy0xNMHT2cJ/+ykfqmtrDLERHpMYX+Efq/fzWVjXVN3PLIQl1DV0SGLIX+ETptfCHfvvQEXl5Zw38+vzLsckREeqRXE65FzWdPL2PZxnruenk1U8cM55JpY8IuSUTkqOhI/yiYGd++9AROHVfANx5fzPJNmppBRIYWhf5RykjEufuzpzA8K8GcX1RQu7s17JJERI6YQr8HRg7P5J6rT2VLfQs3P/xnXU5RRIYMhX4PzSgr4N+uOJE/VW7n9mdWhF2OiMgRUej3wpXlpVx28hju/eNaXllVE3Y5IiLd0uidHmrv6OSul1fzv4s3MXJYBscMzwy7JBGRbin0e2DbrhZufGA+i6rruXT6GP7lshPIz04PuywRkW4p9Htg1ZYGFlXXc9r4Au6cfTJmFnZJIiJHRH36PXDmsUV8+SOTmf9eLf+uL3FFZAjRkX4P3XL+ZGp3tzL31TUU5qRz0znHhl2SiEi3FPo9ZGbc9lcnUNvYxu3PrKAgO43PnFYWdlkiIofVq+4dM8s3syfMbIWZLTezM8ys0MxeMLN3g9uCYFszsx+bWaWZLTazU/rmVwhPLGb84NPTOee4Yr71myU8u3Rz2CWJiBxWb/v07wSedfcpwHRgOXAr8JK7TwZeCpYBLgImBz9zgLt7+dqDQnoixt1Xn8LJpfl86eGFvL56W9gliYgcUo9D38zygLOB+wDcvdXd64DLgAeDzR4ELg/uXwY85ElvAvlmNrqnrz+YZKcnuP+60xhflM2chxawpLo+7JJERA6qN0f6E4Aa4OdmttDM7jWzHGCUu28KttkMjArujwWqujy+Omjbh5nNMbMKM6uoqRk6Z7nmZ6fz0A2nk5eVxnU/f5s1NbvCLklE5AC9Cf0EcApwt7vPAHbzflcOAO7uwFFdZsrd57p7ubuXFxcX96K8gXdMXia//PzpAHzuvrfZVN8UckUiIvvqTehXA9Xu/law/ATJD4Ete7ptgtutwfoNQGmXx5cEbSllQlEOD94wk/qmNq65721NvSwig0qPQ9/dNwNVZnZ80HQ+8A4wD7g2aLsWeDK4Pw+4JhjFMwuo79INlFJOHJvHf19TzrodjVz/wHzWbtsddkkiIgBYsgemhw82Oxm4F0gH1gDXk/wgeQwoA9YBV7r7DkvOVfAT4EKgEbje3SsO9/zl5eVeUXHYTQa155dt5uaHF9LW0cn5U0Zyw1kTOOPYEZq2QUT6lZktcPfyg67rTej3t6Ee+lsbmnl+2Rb+49kV7GxuB+DWi6bo7F0R6VeHC32dkdvHlm2s5+6XV7NwfR0b6pJf5KbFjeml+cwozeejU0d18wwiIv1Hod/HXl5Zw1OLNzGxOId/vPgDzCjL54QxeWSmxcMuTUREs2z2tc9/aAKnjitgY10TMycUcuq4QgW+iAwaCv0+lpGI87PPncqInAz+5qEKNtc3h12SiMheCv1+UJSbwb3XltPQ3M6cX1TQ1NoRdkkiIoBCv998YPRw7pw9gyUb6vnGE4sYzKOkRCQ6FPr96IKpo/i7C6fw1OJN/PilyrDLERHR6J3+9oWzJ7JqSwN3vLiKSSNzuXhaSkwsKiJDlI70+5mZ8e+fOIlTxxXwtcf/ommXRSRUCv0B0HVEz+cfms+WnRrRIyLhUOgPkH1G9DxUQXObRvSIyMBT6A+gPSN6Fm+o5xtPLNaIHhEZcAr9AXbB1FF882NT+N2ijfzX7zWiR0QGlkI/BDedM5FPnDKWO15cxeuVupC6iAwchX4IzIzvXH4SE4ty+Opji6hr1NW1RGRgKPRDkpUe587ZM9i+u4W//+0S9e+LyIBQ6IfoxLF5fPWC43l6yWaeWFAddjkiEgEK/ZDNOXsisyYW8u15y1hUVRd2OSKS4hT6IYvHjB9eeTK5mQmuuOtP/NtT77C7pT3sskQkRSn0B4Ex+Vk8/5VzmD2zjHv/uJaP3vEqb67ZHnZZIpKCFPqDRF5WGt+94iQev+kMEnHjlkcW0tbRGXZZIpJiFPqDzGnjC/nHi6eyZWcLLy3fEnY5IpJiFPqD0HlTRjImL5Nfvrk+7FJEJMUo9AeheMy4amYZf6zcxpqaXWGXIyIppNehb2ZxM1toZk8FyxPM7C0zqzSzR80sPWjPCJYrg/Xje/vaqewzM0tJxIxfvaWjfRHpO31xpH8LsLzL8veAO9x9ElAL3Bi03wjUBu13BNvJIYwclsnHTjyGJxZUaxpmEekzvQp9MysBLgbuDZYNOA94ItjkQeDy4P5lwTLB+vOD7eUQrj59HPVNbfxu0cawSxGRFNHbI/0fAd8E9owtHAHUufues4uqgbHB/bFAFUCwvj7Yfh9mNsfMKsysoqamppflDW2zJhYyaWQuv3xznebmEZE+0ePQN7NLgK3uvqAP68Hd57p7ubuXFxcX9+VTDzlmxjVnjGNRdT3n/ecr3P3yarY26FKLItJziV489izgUjP7OJAJDAfuBPLNLBEczZcAG4LtNwClQLWZJYA8QKedduPq08eRk57g0flVfO/ZFfzg+ZWcN2Uks08r5ZzjiknENQBLRI6c9UW3gZmdC3zd3S8xs8eBX7v7I2Z2D7DY3e8ysy8CJ7n7TWY2G/iEu195uOctLy/3ioqKXteXKlbX7OKxiip+vaCabbtaGTU8g6tmlnHDBycwPDMt7PJEZJAwswXuXn7Qdf0Q+hOBR4BCYCFwtbu3mFkm8AtgBrADmO3uaw73vAr9g2vr6OSl5Vt5dP56/rCyhoLsNL744UlcPWscmWnxsMsTkZD1e+j3F4V+95ZU1/Mfz63gtXe3MSYvkwumjuL0iSM4bXwhxcMywi5PREJwuNDvTZ++hKByawNLN+xkZ3Mb9Y1t7GxuY3ReJsMyE2ysb+bBN9bx4BvrAPjWRVP4wjnHhlyxiAwmCv0h5m8eWsDabbsPaM/NSHDquAJyMhJU72ikoaWdM449YESsiEScQn+IefhvZvH66m0srq5nUXUdyzbupLW9k10t7ayu2cVJY/P4+EmjmVaSx6jhmWGXKyKDjPr0h7i2jk5Wbm5gcXU9i6vrWFRdz6otDXR0Jv+/jhqewbSSfM46dgTXnDGeWEwnQYukOvXpp7C0eIwTx+Zx4tg8/vr0MgCaWjt4Z1M9i6re/yB44Z0trNyyi+9ecSKa/UIkuhT6KSgrPc6p4wo5dVwhAO7O959byV0vryYtbvzzpSco+EUiSqEfAWbGNz52PO2dztxX15CIxfinSz6g4BeJIIV+RJgZ37poCq3tndz/p7WMG5HNtWeOD7ssERlgmrglQsyM2/5qKh+aXMQPnl/Jtl0tYZckIgNMoR8xyeA/gabWDr7/7MqwyxGRAabQj6BJI3O54YMTeGxBFYuq6sIuR0QGkEI/ov72vEmMyMng279bRmfn4D1XQ0T6lkI/ooZlpnHrRVNYuL6Oua+tYXXNLhqa23SFLpEUpzNyI6yz07nyZ29Qsa52b1tBdhr/cPFUPnnKWA3pFBmidEauHFQsZvzixtNZuL6WrQ0tbG1o5oV3tvD1xxfxx3dr+LcrTiI3Q7uISCrRX3TEZaXHOXNS0d7lGz84kZ/+oZIfvbiKhVV1/NdVM5hWkh9egSLSp9SnL/uIx4wvnT+ZR79wBm3tnXzy7tf571fX6MtekRSh0JeDOm18IU/f8iHOmzKS7zy9nOsfmE9Ng07mEhnqFPpySPnZ6dxz9an86+Un8saa7Vx052u89m5N2GWJSC8o9OWwzIzPzRrHvJvPIj87jWvuf5vbn1lBW0dn2KWJSA8o9OWITDlmOL+7+YN8pryUe15ZzafueYO6xtawyxKRo6TRO3JEane3Mnvum6zc0gDAoqo6Hquo4pSyAjLT4mQkYntvczIS5Giop8igpL9MOSLPLN28N/D3+O7TKw65fVlhNieOHc4JY5JX9TphzHCKcjP6u0wR6UaPQ9/MSoGHgFGAA3Pd/U4zKwQeBcYD7wFXunutJU/vvBP4ONAIXOfuf+5d+TJQ/vr0Mj48pZjmtk5a2jtoaeukua2DlvZOWtq73u+gdncr72zaybKNO3l6yea9z3HM8ExmlOVz7vHFfPj4kYzUhdtFBlxvjvTbga+5+5/NbBiwwMxeAK4DXnL3283sVuBW4O+Ai4DJwc/pwN3BrQwRo/Oyjnhbd6emoYXF1fW8sqqGP1VuY8223TyzdDPPLH3/g2DhP11AQU56f5QrIgfR49B3903ApuB+g5ktB8YClwHnBps9CLxMMvQvAx7y5GQ/b5pZvpmNDp5HhrCahhb+49kVbG1oYduu5M/2Xa20H+SErsy0GM1tyZE/RbnpxDS/j8iA6pM+fTMbD8wA3gJGdQnyzSS7fyD5gVDV5WHVQZtCf4j7l6fe4XeLNh52m+JhGZQUZFGUm8GwjATDMhPkZiZ4eP56coPl5E8auRkJxuRnkZeVNkC/gUh09Dr0zSwX+DXwZXff2XVmRnd3Mzuq8/fNbA4wB6CsrKy35ckA+O4VJ/KhyUUMy0jQ3N5BQ3P73p9dLW3J22C5uraJhuY2drUklzsOMb1DUW4G8//hfM30KdLHehX6ZpZGMvB/5e6/CZq37Om2MbPRwNagfQNQ2uXhJUHbPtx9LjAXklMr96Y+GRjDMtO4sry0+w334+40tXWwq7mdnc3t7Gpp5ye/f5cXl2/ls6eXKfBF+kGPT84KRuPcByx39x92WTUPuDa4fy3wZJf2ayxpFlCv/vxoMzOy0xOMHJ7JpJG5VO1o5MXlW7n85DF8+SOTwy5PJCX15kj/LOBzwBIz+0vQ9vfA7cBjZnYjsA64Mlj3NMnhmpUkh2xe34vXlhSzYF0tX3t8EROLcvg/H55EdW0TibgRjxmJWIxE3EjEksvp8Zj+FSDSQ7pylgwKn7r79X2u4HU4V80s5d8/Ma2fKxIZunTlLBn0bv/kNFZtaaC90+no7KStw+nodFrbO7lt3rJ9tn1vWyN3vVxJbkaC7PQEuRlxstMTwfQPcXKC+3lZacRj+heBSFcKfRkUJo3MZdLI3APaOzudV1fVsHb7bhpbOtjd2s5ba7fzxprt3T7naeMLePymM/ujXJEhS6Evg1osZtx33Wn7tLk7Le2d7Gppp7GlgyUb6nlk/nr+WLkNd4gZnHlsETd8cHw4RYsMYgp9GXLMjMy0OLWNrZz9/T/ss+7MY0fw+Q9NYFpJPiM0vYPIART6MmQd7Lyu11dv5/XVya6fzLQYY/KzGJufRUlB8nZsQRZj87MZW5DFMcMz1ecvkaPQlyFrbH4W791+MZDs8qlvamNjXTPVtY1sqGtiQ21T8rauidfe3XbI55lYlMPvv37uAFUtEi4N2ZSUV7u7lRn/+sJht8lOj5OIGTub2w+73ZJvf5RhmZoTSAY3DdmUSMvLSuMfL/4Ab6/dQUlBNsXDMuh0p62jk/YOp60zedve0cmDb6w77HNtrm8mEYuRlR4foOpF+paO9EX2s7m+mScWVPFoRRVVO5oOuk1mWozC7HQKctIpzEmnILvrbRoFwf2C7HSOHZlDRkIfEjJwDnekr9AXOQR3p7axjdrGVmp3t7Jjdyu1ja3s2N0W3CbbaxtbqW1sY8fuVuqb2g54nk+dWsIPPj09hN9AokrdOyI9YGYUBkfyFB96uxWbd3Lhj1476LqLp43m5g9P6qcKRY6eQl+kB367sJqvPLqo2+3+d/Em/nfxvpPJ/vz608hIxIhZcgK55KRy708st2f5YJPNJWIx0hMxDTWVHlPoi/TAz15Z0+PHXv/z+b1+/T1DVUWOlvr0RXrI3YMJ4oLbDqe9s5OOTqdtv+U927V1vL9872treHH51u5f6Cg8eMNMzjnuMH1REgnq0xfpB2ZGWtxI6+HAnFkTR+y9f88rq7n9mRW9rulv/9+f+f6npwczjcbJyUiQnR5nWEYaedk6v0B0pC8yKLg789+rpaPTiRl0uLOxrpn123ezfkcjtY1tNLV10NzWQVNrxz73G9s6OJI/49s/cRKzZ+q601GgI32RQc7MmDmh8Kge4+585IevsLpm9xFtX7Gulg11Tft8gZy8DzF7/8viw12VbFhmgvzsdAqy0zj+mGE6/2AIUuiLDFFmydA+mOGZCcyMzk6nI/ju4X8WbqDD/Yj+VXCkvnD2RGIx45SyAi6YOqrvnlj6jbp3RCLm2aWbuemXC/r8eW8651jKCrPfH14a33c4any/oad718X3Xe762LSDPC5m6BrJ3VD3jojsNXnUgVco6wv3vLK6X563O8MyE3xu1jgccAfHCf7Dg3/ZdDrMOXsix+RlhlLjYKIjfRE5anu6jfYfhtrR6dzwwHyWbdwZdokHmDmhkMe+cEbYZQwIzb0jIqF4vKKKbzyxOOwyeuXVb3yY0sKsIdWlpO4dEQnFJ08pIS0eY9uuln3+NdDemZzKuqmtg8aWDna1ttPU2rH3cb9f0bcnrfXG/pfkPPu4Ygz2frcQM3hr7Q4aurkWQ15WGsMyE2SmxclMi5GRiJORiJGRiJGZFuf4Y4bxt+dN7vcpNnSkLyKDTtWORv5YuS0ZrBhYclipAbFYss2C0E0GcLBMsv++053Ovf35nmzrTLZt393K959b2aO6YgbTSvKT3xXA3tfoy+6sPb/H4zedyanjCnr4HIPoSN/MLgTuBOLAve5++0DXICKDW2lhNlf144lkX+zlzKdPL9nEfz6/kpb2TlrbO/uoqiQPvoT+n4Ubehz6hzOgoW9mceCnwAVANTDfzOa5+zsDWYeISG/kZaUx5ZjhpAfdM+mJGOnx4LbLcrL7Jr5P2/7bZKbFSI/HD2hPi/dPN89AH+nPBCrdfQ2AmT0CXAYo9EVkyDhrUhFnTSoKu4weiQ3w640FqrosVwdtIiIyAAY69LtlZnPMrMLMKmpqasIuR0QkpQx06G8ASrsslwRte7n7XHcvd/fy4mLNCy4i0pcGOvTnA5PNbIKZpQOzgXkDXIOISGQN6Be57t5uZjcDz5Ecsnm/uy8byBpERKJswMfpu/vTwNMD/boiIjIIv8gVEZH+o9AXEYmQQT33jpnVAOvCrqOXioBtYRcxiOj92Jfej/fpvdhXb96Pce5+0OGPgzr0U4GZVRxq4qMo0vuxL70f79N7sa/+ej/UvSMiEiEKfRGRCFHo97+5YRcwyOj92Jfej/fpvdhXv7wf6tMXEYkQHemLiESIQl9EJEIU+n3IzErN7A9m9o6ZLTOzW4L2QjN7wczeDW77/hpog5SZxc1soZk9FSxPMLO3zKzSzB4NJt6LBDPLN7MnzGyFmS03szMivm98Jfg7WWpmD5tZZpT2DzO738y2mtnSLm0H3R8s6cfB+7LYzE7p6esq9PtWO/A1d58KzAK+aGZTgVuBl9x9MvBSsBwVtwDLuyx/D7jD3ScBtcCNoVQVjjuBZ919CjCd5PsSyX3DzMYCXwLK3f1EkhMwziZa+8cDwIX7tR1qf7gImBz8zAHu7vGrurt++ukHeJLk9YBXAqODttHAyrBrG6DfvyTYcc8DngKM5BmGiWD9GcBzYdc5QO9FHrCWYPBEl/ao7ht7rqJXSHLix6eAj0Vt/wDGA0u72x+AnwFXHWy7o/3RkX4/MbPxwAzgLWCUu28KVm0GRoVV1wD7EfBNoDNYHgHUuXt7sByly2VOAGqAnwfdXfeaWQ4R3TfcfQPwA2A9sAmoBxYQ3f1jj0PtD312qVmFfj8ws1zg18CX3X1n13We/JhO+XGyZnYJsNXdF4RdyyCRAE4B7nb3GcBu9uvKicq+ARD0VV9G8sNwDJDDgV0dkdZf+4NCv4+ZWRrJwP+Vu/8maN5iZqOD9aOBrWHVN4DOAi41s/eAR0h28dwJ5JvZnus4HHC5zBRWDVS7+1vB8hMkPwSiuG8AfARY6+417t4G/IbkPhPV/WOPQ+0P3V5q9kgp9PuQmRlwH7Dc3X/YZdU84Nrg/rUk+/pTmrt/y91L3H08yS/ofu/unwX+AHwq2CwS7wWAu28Gqszs+KDpfOAdIrhvBNYDs8wsO/i72fN+RHL/6OJQ+8M84JpgFM8soL5LN9BR0Rm5fcjMPgi8Bizh/X7svyfZr/8YUEZyqugr3X1HKEWGwMzOBb7u7peY2USSR/6FwELgandvCbG8AWNmJwP3AunAGuB6kgdekdw3zOyfgc+QHPW2EPg8yX7qSOwfZvYwcC7JKZS3ALcB/8NB9ofgg/EnJLvAGoHr3b2iR6+r0BcRiQ5174iIRIhCX0QkQhT6IiIRotAXEYkQhb6ISIQo9EVEIkShLyISIf8fwU7ge42CdPgAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "click_df.groupby('user_id')['click'].apply(lambda x: x.count()).value_counts().plot()\n",
    "# 这个如果用划窗的话， 历史点击次数太多的，这种不好用  需要限制下点击次数，比如对于不活跃的用户，可以使用划窗数据"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c10db4b",
   "metadata": {},
   "source": [
    "## 看点击数据中的时长\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "f6e741ca",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3       4638\n",
       "4       4623\n",
       "6       4512\n",
       "5       4478\n",
       "7       4288\n",
       "        ... \n",
       "1138       1\n",
       "1678       1\n",
       "1544       1\n",
       "1318       1\n",
       "2266       1\n",
       "Name: duration, Length: 1771, dtype: int64"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "click_df['duration'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3be4a455",
   "metadata": {},
   "source": [
    "感觉得把这种观看时间太短的去掉"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "3e4cdaeb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(586056, 9)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "click_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13e01f44",
   "metadata": {},
   "source": [
    "## 看缺失情况"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "23eb44fb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "user_id         0\n",
       "device      67744\n",
       "os          67808\n",
       "province    92260\n",
       "city        96072\n",
       "age         59891\n",
       "gender      58572\n",
       "dtype: int64"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user_info.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "ea723818",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "article_id        0\n",
       "title             0\n",
       "ctime           212\n",
       "img_num         212\n",
       "cat_1           255\n",
       "cat_2           256\n",
       "key_words     10137\n",
       "dtype: int64"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_info.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "d01178b7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(230841, 7)"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_info.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "id": "1137bc6d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "user_id          0\n",
       "article_id       0\n",
       "expo_time        0\n",
       "net_status       0\n",
       "flush_nums       0\n",
       "expo_position    0\n",
       "click            0\n",
       "duration         0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2b5dddf",
   "metadata": {},
   "source": [
    "用户画像数据缺失比较多，需要处理， 而doc_info数据可以采用简单填充方式， 关键词目前用不到"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42bc0f85",
   "metadata": {},
   "source": [
    "## 处理用户数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "81f3efdf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>device</th>\n",
       "      <th>os</th>\n",
       "      <th>province</th>\n",
       "      <th>city</th>\n",
       "      <th>age</th>\n",
       "      <th>gender</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1000372820</td>\n",
       "      <td>TAS-AN00</td>\n",
       "      <td>Android</td>\n",
       "      <td>广东</td>\n",
       "      <td>广州</td>\n",
       "      <td>A_0_24:0.404616,A_25_29:0.059027,A_30_39:0.516...</td>\n",
       "      <td>female:0.051339,male:0.948661</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1000652892</td>\n",
       "      <td>PACM00</td>\n",
       "      <td>Android</td>\n",
       "      <td>河北</td>\n",
       "      <td>唐山</td>\n",
       "      <td>A_0_24:0.615458,A_25_29:0.086233,A_30_39:0.141...</td>\n",
       "      <td>female:0.280295,male:0.719705</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1000908852</td>\n",
       "      <td>MI6X</td>\n",
       "      <td>Android</td>\n",
       "      <td>上海</td>\n",
       "      <td>上海</td>\n",
       "      <td>A_0_24:0.123255,A_25_29:0.208225,A_30_39:0.298...</td>\n",
       "      <td>female:0.000000,male:1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1001168798</td>\n",
       "      <td>iPhone11</td>\n",
       "      <td>IOS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A_0_24:0.436296,A_25_29:0.489370,A_30_39:0.061...</td>\n",
       "      <td>female:0.870710,male:0.129290</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1001305614</td>\n",
       "      <td>M2103K19C</td>\n",
       "      <td>Android</td>\n",
       "      <td>江苏</td>\n",
       "      <td>苏州</td>\n",
       "      <td>A_0_24:0.006632,A_25_29:0.043408,A_30_39:0.350...</td>\n",
       "      <td>female:0.000000,male:1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      user_id     device       os province city  \\\n",
       "0  1000372820   TAS-AN00  Android       广东   广州   \n",
       "1  1000652892     PACM00  Android       河北   唐山   \n",
       "2  1000908852       MI6X  Android       上海   上海   \n",
       "3  1001168798   iPhone11      IOS      NaN  NaN   \n",
       "4  1001305614  M2103K19C  Android       江苏   苏州   \n",
       "\n",
       "                                                 age  \\\n",
       "0  A_0_24:0.404616,A_25_29:0.059027,A_30_39:0.516...   \n",
       "1  A_0_24:0.615458,A_25_29:0.086233,A_30_39:0.141...   \n",
       "2  A_0_24:0.123255,A_25_29:0.208225,A_30_39:0.298...   \n",
       "3  A_0_24:0.436296,A_25_29:0.489370,A_30_39:0.061...   \n",
       "4  A_0_24:0.006632,A_25_29:0.043408,A_30_39:0.350...   \n",
       "\n",
       "                          gender  \n",
       "0  female:0.051339,male:0.948661  \n",
       "1  female:0.280295,male:0.719705  \n",
       "2  female:0.000000,male:1.000000  \n",
       "3  female:0.870710,male:0.129290  \n",
       "4  female:0.000000,male:1.000000  "
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user_info.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90104430",
   "metadata": {},
   "source": [
    "age和gender必须得选择出一种来， 根据概率取值选择"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "id": "ff475f23",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "user_id      int64\n",
       "device      object\n",
       "os          object\n",
       "province    object\n",
       "city        object\n",
       "age         object\n",
       "gender      object\n",
       "dtype: object"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user_info.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "194a9219",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_age_gender(x):\n",
    "    # 空值注意下\n",
    "    if pd.isna(x):\n",
    "        return x\n",
    "    x_list = x.split(',')\n",
    "    age_stage_val = list(map(lambda x: x.split(':'), x_list))\n",
    "    age_stage = list(map(lambda x: x[0], age_stage_val))\n",
    "    age_val = list(map(lambda x: x[1], age_stage_val))\n",
    "    return age_stage[np.argmax(age_val)]\n",
    "\n",
    "user_info['age'] = user_info['age'].apply(lambda x: get_age_gender(x))\n",
    "user_info['gender'] = user_info['gender'].apply(lambda x: get_age_gender(x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "ba8f9174",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>device</th>\n",
       "      <th>os</th>\n",
       "      <th>province</th>\n",
       "      <th>city</th>\n",
       "      <th>age</th>\n",
       "      <th>gender</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1000372820</td>\n",
       "      <td>TAS-AN00</td>\n",
       "      <td>Android</td>\n",
       "      <td>广东</td>\n",
       "      <td>广州</td>\n",
       "      <td>A_30_39</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1000652892</td>\n",
       "      <td>PACM00</td>\n",
       "      <td>Android</td>\n",
       "      <td>河北</td>\n",
       "      <td>唐山</td>\n",
       "      <td>A_0_24</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1000908852</td>\n",
       "      <td>MI6X</td>\n",
       "      <td>Android</td>\n",
       "      <td>上海</td>\n",
       "      <td>上海</td>\n",
       "      <td>A_40+</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1001168798</td>\n",
       "      <td>iPhone11</td>\n",
       "      <td>IOS</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A_25_29</td>\n",
       "      <td>female</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1001305614</td>\n",
       "      <td>M2103K19C</td>\n",
       "      <td>Android</td>\n",
       "      <td>江苏</td>\n",
       "      <td>苏州</td>\n",
       "      <td>A_40+</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      user_id     device       os province city      age  gender\n",
       "0  1000372820   TAS-AN00  Android       广东   广州  A_30_39    male\n",
       "1  1000652892     PACM00  Android       河北   唐山   A_0_24    male\n",
       "2  1000908852       MI6X  Android       上海   上海    A_40+    male\n",
       "3  1001168798   iPhone11      IOS      NaN  NaN  A_25_29  female\n",
       "4  1001305614  M2103K19C  Android       江苏   苏州    A_40+    male"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user_info.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "9a118a1c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "user_id     0\n",
       "device      0\n",
       "os          0\n",
       "province    0\n",
       "city        0\n",
       "age         0\n",
       "gender      0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user_info.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "d15956f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 填充空值\n",
    "user_info.fillna('nan', inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "713174b5",
   "metadata": {},
   "source": [
    "user数据的编码转换，在这里先不做了， 拼接到日志数据上去，在YouTubeDNN那里统一做"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "0b35b6d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# from sklearn.preprocessing import LabelEncoder\n",
    "# 编码转换\n",
    "# cols = user_info.columns[1:]  # Index(['device', 'os', 'province', 'city', 'age', 'gender'], dtype='object')\n",
    "\n",
    "# for col in cols:\n",
    "#     enc = LabelEncoder()\n",
    "#     user_info[col] = enc.fit_transform(user_info[col])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "bf970484",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>device</th>\n",
       "      <th>os</th>\n",
       "      <th>province</th>\n",
       "      <th>city</th>\n",
       "      <th>age</th>\n",
       "      <th>gender</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1000372820</td>\n",
       "      <td>TAS-AN00</td>\n",
       "      <td>Android</td>\n",
       "      <td>广东</td>\n",
       "      <td>广州</td>\n",
       "      <td>A_30_39</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1000652892</td>\n",
       "      <td>PACM00</td>\n",
       "      <td>Android</td>\n",
       "      <td>河北</td>\n",
       "      <td>唐山</td>\n",
       "      <td>A_0_24</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1000908852</td>\n",
       "      <td>MI6X</td>\n",
       "      <td>Android</td>\n",
       "      <td>上海</td>\n",
       "      <td>上海</td>\n",
       "      <td>A_40+</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1001168798</td>\n",
       "      <td>iPhone11</td>\n",
       "      <td>IOS</td>\n",
       "      <td>nan</td>\n",
       "      <td>nan</td>\n",
       "      <td>A_25_29</td>\n",
       "      <td>female</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1001305614</td>\n",
       "      <td>M2103K19C</td>\n",
       "      <td>Android</td>\n",
       "      <td>江苏</td>\n",
       "      <td>苏州</td>\n",
       "      <td>A_40+</td>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      user_id     device       os province city      age  gender\n",
       "0  1000372820   TAS-AN00  Android       广东   广州  A_30_39    male\n",
       "1  1000652892     PACM00  Android       河北   唐山   A_0_24    male\n",
       "2  1000908852       MI6X  Android       上海   上海    A_40+    male\n",
       "3  1001168798   iPhone11      IOS      nan  nan  A_25_29  female\n",
       "4  1001305614  M2103K19C  Android       江苏   苏州    A_40+    male"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user_info.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "6dee90a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 保存一份新数据\n",
    "user_info.to_csv('data_process/user_info.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc4446c0",
   "metadata": {},
   "source": [
    "## 处理文章数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "2dd9a2d8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>article_id</th>\n",
       "      <th>title</th>\n",
       "      <th>ctime</th>\n",
       "      <th>img_num</th>\n",
       "      <th>cat_1</th>\n",
       "      <th>cat_2</th>\n",
       "      <th>key_words</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>349635709</td>\n",
       "      <td>拿到c1驾照后,实习期扣分了会怎样?扣12分驾照会吊销么?</td>\n",
       "      <td>1572519971000</td>\n",
       "      <td>9</td>\n",
       "      <td>汽车</td>\n",
       "      <td>汽车/用车</td>\n",
       "      <td>上班族:8.469502,买车:8.137443,二手车:9.022247,副页:11.21...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>361653323</td>\n",
       "      <td>疫情谣言粉碎机丨接种新冠疫苗后用麻药或致死?盘点最新疫情谣言,别被忽悠了</td>\n",
       "      <td>1624522285000</td>\n",
       "      <td>1</td>\n",
       "      <td>健康</td>\n",
       "      <td>健康/疾病防护治疗及西医用药</td>\n",
       "      <td>医生:14.760494,吸烟:16.474872,板蓝根:15.597788,板蓝根^^熏...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>426732705</td>\n",
       "      <td>实拍本田飞度:空间真大,8万出头工薪族可选,但内饰能忍?</td>\n",
       "      <td>1610808303000</td>\n",
       "      <td>9</td>\n",
       "      <td>汽车</td>\n",
       "      <td>汽车/买车</td>\n",
       "      <td>155n:8.979802,polo:7.951116,中控台:5.954278,中网:7....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>430221183</td>\n",
       "      <td>搭载135kw电机比亚迪秦plus纯电动版外观更精致</td>\n",
       "      <td>1612581556000</td>\n",
       "      <td>2</td>\n",
       "      <td>汽车</td>\n",
       "      <td>汽车/买车</td>\n",
       "      <td>etc:12.055207,代表:8.878175,内饰:5.342025,刀片:9.453...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>441756326</td>\n",
       "      <td>【提车作业】不顾他人眼光帕萨特phev俘获30老男人浪子心</td>\n",
       "      <td>1618825835000</td>\n",
       "      <td>23</td>\n",
       "      <td>汽车</td>\n",
       "      <td>汽车/买车</td>\n",
       "      <td>丰田凯美瑞:12.772149,充电器:8.394001,品牌:8.436843,城市:7....</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   article_id                                 title          ctime img_num  \\\n",
       "0   349635709         拿到c1驾照后,实习期扣分了会怎样?扣12分驾照会吊销么?  1572519971000       9   \n",
       "1   361653323  疫情谣言粉碎机丨接种新冠疫苗后用麻药或致死?盘点最新疫情谣言,别被忽悠了  1624522285000       1   \n",
       "2   426732705          实拍本田飞度:空间真大,8万出头工薪族可选,但内饰能忍?  1610808303000       9   \n",
       "3   430221183            搭载135kw电机比亚迪秦plus纯电动版外观更精致  1612581556000       2   \n",
       "4   441756326         【提车作业】不顾他人眼光帕萨特phev俘获30老男人浪子心  1618825835000      23   \n",
       "\n",
       "  cat_1           cat_2                                          key_words  \n",
       "0    汽车           汽车/用车  上班族:8.469502,买车:8.137443,二手车:9.022247,副页:11.21...  \n",
       "1    健康  健康/疾病防护治疗及西医用药  医生:14.760494,吸烟:16.474872,板蓝根:15.597788,板蓝根^^熏...  \n",
       "2    汽车           汽车/买车  155n:8.979802,polo:7.951116,中控台:5.954278,中网:7....  \n",
       "3    汽车           汽车/买车  etc:12.055207,代表:8.878175,内饰:5.342025,刀片:9.453...  \n",
       "4    汽车           汽车/买车  丰田凯美瑞:12.772149,充电器:8.394001,品牌:8.436843,城市:7....  "
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_info.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "65005c26",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1625400960000.0\n",
       "dtype: object"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_info['ctime'].mode()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "ca98a5d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "doc_info['ctime'] = doc_info['ctime'].str.replace('Android', '1625400960000')\n",
    "doc_info['ctime'].fillna('1625400960000', inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c4cd888",
   "metadata": {},
   "source": [
    "时间列处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "84e5a9bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "doc_info['ctime'] = doc_info['ctime'].apply(lambda x: datetime.fromtimestamp(int(x)/1000) \\\n",
    "                                                        .strftime('%Y-%m-%d %H:%M:%S'))\n",
    "doc_info['ctime'] = pd.to_datetime(doc_info['ctime'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "0fc8090f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>article_id</th>\n",
       "      <th>title</th>\n",
       "      <th>ctime</th>\n",
       "      <th>img_num</th>\n",
       "      <th>cat_1</th>\n",
       "      <th>cat_2</th>\n",
       "      <th>key_words</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>349635709</td>\n",
       "      <td>拿到c1驾照后,实习期扣分了会怎样?扣12分驾照会吊销么?</td>\n",
       "      <td>2019-10-31 19:06:11</td>\n",
       "      <td>9</td>\n",
       "      <td>汽车</td>\n",
       "      <td>汽车/用车</td>\n",
       "      <td>上班族:8.469502,买车:8.137443,二手车:9.022247,副页:11.21...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>361653323</td>\n",
       "      <td>疫情谣言粉碎机丨接种新冠疫苗后用麻药或致死?盘点最新疫情谣言,别被忽悠了</td>\n",
       "      <td>2021-06-24 16:11:25</td>\n",
       "      <td>1</td>\n",
       "      <td>健康</td>\n",
       "      <td>健康/疾病防护治疗及西医用药</td>\n",
       "      <td>医生:14.760494,吸烟:16.474872,板蓝根:15.597788,板蓝根^^熏...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>426732705</td>\n",
       "      <td>实拍本田飞度:空间真大,8万出头工薪族可选,但内饰能忍?</td>\n",
       "      <td>2021-01-16 22:45:03</td>\n",
       "      <td>9</td>\n",
       "      <td>汽车</td>\n",
       "      <td>汽车/买车</td>\n",
       "      <td>155n:8.979802,polo:7.951116,中控台:5.954278,中网:7....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>430221183</td>\n",
       "      <td>搭载135kw电机比亚迪秦plus纯电动版外观更精致</td>\n",
       "      <td>2021-02-06 11:19:16</td>\n",
       "      <td>2</td>\n",
       "      <td>汽车</td>\n",
       "      <td>汽车/买车</td>\n",
       "      <td>etc:12.055207,代表:8.878175,内饰:5.342025,刀片:9.453...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>441756326</td>\n",
       "      <td>【提车作业】不顾他人眼光帕萨特phev俘获30老男人浪子心</td>\n",
       "      <td>2021-04-19 17:50:35</td>\n",
       "      <td>23</td>\n",
       "      <td>汽车</td>\n",
       "      <td>汽车/买车</td>\n",
       "      <td>丰田凯美瑞:12.772149,充电器:8.394001,品牌:8.436843,城市:7....</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   article_id                                 title               ctime  \\\n",
       "0   349635709         拿到c1驾照后,实习期扣分了会怎样?扣12分驾照会吊销么? 2019-10-31 19:06:11   \n",
       "1   361653323  疫情谣言粉碎机丨接种新冠疫苗后用麻药或致死?盘点最新疫情谣言,别被忽悠了 2021-06-24 16:11:25   \n",
       "2   426732705          实拍本田飞度:空间真大,8万出头工薪族可选,但内饰能忍? 2021-01-16 22:45:03   \n",
       "3   430221183            搭载135kw电机比亚迪秦plus纯电动版外观更精致 2021-02-06 11:19:16   \n",
       "4   441756326         【提车作业】不顾他人眼光帕萨特phev俘获30老男人浪子心 2021-04-19 17:50:35   \n",
       "\n",
       "  img_num cat_1           cat_2  \\\n",
       "0       9    汽车           汽车/用车   \n",
       "1       1    健康  健康/疾病防护治疗及西医用药   \n",
       "2       9    汽车           汽车/买车   \n",
       "3       2    汽车           汽车/买车   \n",
       "4      23    汽车           汽车/买车   \n",
       "\n",
       "                                           key_words  \n",
       "0  上班族:8.469502,买车:8.137443,二手车:9.022247,副页:11.21...  \n",
       "1  医生:14.760494,吸烟:16.474872,板蓝根:15.597788,板蓝根^^熏...  \n",
       "2  155n:8.979802,polo:7.951116,中控台:5.954278,中网:7....  \n",
       "3  etc:12.055207,代表:8.878175,内饰:5.342025,刀片:9.453...  \n",
       "4  丰田凯美瑞:12.772149,充电器:8.394001,品牌:8.436843,城市:7....  "
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_info.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "9b342c88",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "article_id        0\n",
       "title             0\n",
       "ctime             0\n",
       "img_num           0\n",
       "cat_1             0\n",
       "cat_2             0\n",
       "key_words     10137\n",
       "dtype: int64"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_info.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "01b01679",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 填充其他列的缺失\n",
    "doc_info['img_num'].fillna(0.0, inplace=True)\n",
    "doc_info['cat_1'].fillna(doc_info['cat_1'].mode()[0], inplace=True)\n",
    "doc_info['cat_2'].fillna(doc_info['cat_2'].mode()[0], inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 139,
   "id": "24b7bff8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(592749, 7)"
      ]
     },
     "execution_count": 139,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_info.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4fd13893",
   "metadata": {},
   "source": [
    "## 处理日志数据\n",
    "日志数据先拼接上用户画像，保存一份"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "069534be",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>article_id</th>\n",
       "      <th>expo_time</th>\n",
       "      <th>net_status</th>\n",
       "      <th>flush_nums</th>\n",
       "      <th>exop_position</th>\n",
       "      <th>click</th>\n",
       "      <th>duration</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>464467760</td>\n",
       "      <td>2021-06-30 09:57:14</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>13</td>\n",
       "      <td>1</td>\n",
       "      <td>28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>463850913</td>\n",
       "      <td>2021-06-30 09:57:14</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>15</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>464022440</td>\n",
       "      <td>2021-06-30 09:57:14</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>17</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>464586545</td>\n",
       "      <td>2021-06-30 09:58:31</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>465352885</td>\n",
       "      <td>2021-07-03 18:13:03</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>18</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      user_id  article_id           expo_time  net_status  flush_nums  \\\n",
       "0  1000541010   464467760 2021-06-30 09:57:14           2           0   \n",
       "1  1000541010   463850913 2021-06-30 09:57:14           2           0   \n",
       "2  1000541010   464022440 2021-06-30 09:57:14           2           0   \n",
       "3  1000541010   464586545 2021-06-30 09:58:31           2           1   \n",
       "4  1000541010   465352885 2021-07-03 18:13:03           5           0   \n",
       "\n",
       "   exop_position  click  duration  \n",
       "0             13      1        28  \n",
       "1             15      0         0  \n",
       "2             17      0         0  \n",
       "3             20      0         0  \n",
       "4             18      0         0  "
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "cce4f479",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data = train_data.merge(user_info, on='user_id', how='left')\n",
    "train_data_new = train_data.merge(doc_info[['article_id', 'ctime', 'img_num', 'cat_1', 'cat_2']], on='article_id', how='left')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "a58e4b40",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>article_id</th>\n",
       "      <th>expo_time</th>\n",
       "      <th>net_status</th>\n",
       "      <th>flush_nums</th>\n",
       "      <th>exop_position</th>\n",
       "      <th>click</th>\n",
       "      <th>duration</th>\n",
       "      <th>device</th>\n",
       "      <th>os</th>\n",
       "      <th>province</th>\n",
       "      <th>city</th>\n",
       "      <th>age</th>\n",
       "      <th>gender</th>\n",
       "      <th>ctime</th>\n",
       "      <th>img_num</th>\n",
       "      <th>cat_1</th>\n",
       "      <th>cat_2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>464467760</td>\n",
       "      <td>2021-06-30 09:57:14</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>13</td>\n",
       "      <td>1</td>\n",
       "      <td>28</td>\n",
       "      <td>V2054A</td>\n",
       "      <td>Android</td>\n",
       "      <td>上海</td>\n",
       "      <td>上海</td>\n",
       "      <td>A_0_24</td>\n",
       "      <td>female</td>\n",
       "      <td>2021-06-29 14:46:43</td>\n",
       "      <td>3</td>\n",
       "      <td>娱乐</td>\n",
       "      <td>娱乐/港台明星</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>463850913</td>\n",
       "      <td>2021-06-30 09:57:14</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>15</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>V2054A</td>\n",
       "      <td>Android</td>\n",
       "      <td>上海</td>\n",
       "      <td>上海</td>\n",
       "      <td>A_0_24</td>\n",
       "      <td>female</td>\n",
       "      <td>2021-06-27 22:29:13</td>\n",
       "      <td>11</td>\n",
       "      <td>时尚</td>\n",
       "      <td>时尚/女性时尚</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>464022440</td>\n",
       "      <td>2021-06-30 09:57:14</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>17</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>V2054A</td>\n",
       "      <td>Android</td>\n",
       "      <td>上海</td>\n",
       "      <td>上海</td>\n",
       "      <td>A_0_24</td>\n",
       "      <td>female</td>\n",
       "      <td>2021-06-28 12:22:54</td>\n",
       "      <td>7</td>\n",
       "      <td>农村</td>\n",
       "      <td>农村/农业资讯</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>464586545</td>\n",
       "      <td>2021-06-30 09:58:31</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>V2054A</td>\n",
       "      <td>Android</td>\n",
       "      <td>上海</td>\n",
       "      <td>上海</td>\n",
       "      <td>A_0_24</td>\n",
       "      <td>female</td>\n",
       "      <td>2021-06-29 13:25:06</td>\n",
       "      <td>5</td>\n",
       "      <td>娱乐</td>\n",
       "      <td>娱乐/港台明星</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>465352885</td>\n",
       "      <td>2021-07-03 18:13:03</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>18</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>V2054A</td>\n",
       "      <td>Android</td>\n",
       "      <td>上海</td>\n",
       "      <td>上海</td>\n",
       "      <td>A_0_24</td>\n",
       "      <td>female</td>\n",
       "      <td>2021-07-02 10:43:51</td>\n",
       "      <td>18</td>\n",
       "      <td>娱乐</td>\n",
       "      <td>娱乐/港台明星</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      user_id  article_id           expo_time  net_status  flush_nums  \\\n",
       "0  1000541010   464467760 2021-06-30 09:57:14           2           0   \n",
       "1  1000541010   463850913 2021-06-30 09:57:14           2           0   \n",
       "2  1000541010   464022440 2021-06-30 09:57:14           2           0   \n",
       "3  1000541010   464586545 2021-06-30 09:58:31           2           1   \n",
       "4  1000541010   465352885 2021-07-03 18:13:03           5           0   \n",
       "\n",
       "   exop_position  click  duration  device       os province city     age  \\\n",
       "0             13      1        28  V2054A  Android       上海   上海  A_0_24   \n",
       "1             15      0         0  V2054A  Android       上海   上海  A_0_24   \n",
       "2             17      0         0  V2054A  Android       上海   上海  A_0_24   \n",
       "3             20      0         0  V2054A  Android       上海   上海  A_0_24   \n",
       "4             18      0         0  V2054A  Android       上海   上海  A_0_24   \n",
       "\n",
       "   gender               ctime img_num cat_1    cat_2  \n",
       "0  female 2021-06-29 14:46:43       3    娱乐  娱乐/港台明星  \n",
       "1  female 2021-06-27 22:29:13      11    时尚  时尚/女性时尚  \n",
       "2  female 2021-06-28 12:22:54       7    农村  农村/农业资讯  \n",
       "3  female 2021-06-29 13:25:06       5    娱乐  娱乐/港台明星  \n",
       "4  female 2021-07-02 10:43:51      18    娱乐  娱乐/港台明星  "
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data_new.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "8b209079",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "user_id          0\n",
       "article_id       0\n",
       "expo_time        0\n",
       "net_status       0\n",
       "flush_nums       0\n",
       "exop_position    0\n",
       "click            0\n",
       "duration         0\n",
       "device           0\n",
       "os               0\n",
       "province         0\n",
       "city             0\n",
       "age              0\n",
       "gender           0\n",
       "ctime            0\n",
       "img_num          0\n",
       "cat_1            0\n",
       "cat_2            0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data_new.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "d98ef258",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>article_id</th>\n",
       "      <th>expo_time</th>\n",
       "      <th>net_status</th>\n",
       "      <th>flush_nums</th>\n",
       "      <th>exop_position</th>\n",
       "      <th>click</th>\n",
       "      <th>duration</th>\n",
       "      <th>device</th>\n",
       "      <th>os</th>\n",
       "      <th>province</th>\n",
       "      <th>city</th>\n",
       "      <th>age</th>\n",
       "      <th>gender</th>\n",
       "      <th>ctime</th>\n",
       "      <th>img_num</th>\n",
       "      <th>cat_1</th>\n",
       "      <th>cat_2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>464467760</td>\n",
       "      <td>2021-06-30 09:57:14</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>13</td>\n",
       "      <td>1</td>\n",
       "      <td>28</td>\n",
       "      <td>V2054A</td>\n",
       "      <td>Android</td>\n",
       "      <td>上海</td>\n",
       "      <td>上海</td>\n",
       "      <td>A_0_24</td>\n",
       "      <td>female</td>\n",
       "      <td>2021-06-29 14:46:43</td>\n",
       "      <td>3</td>\n",
       "      <td>娱乐</td>\n",
       "      <td>娱乐/港台明星</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>463850913</td>\n",
       "      <td>2021-06-30 09:57:14</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>15</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>V2054A</td>\n",
       "      <td>Android</td>\n",
       "      <td>上海</td>\n",
       "      <td>上海</td>\n",
       "      <td>A_0_24</td>\n",
       "      <td>female</td>\n",
       "      <td>2021-06-27 22:29:13</td>\n",
       "      <td>11</td>\n",
       "      <td>时尚</td>\n",
       "      <td>时尚/女性时尚</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>464022440</td>\n",
       "      <td>2021-06-30 09:57:14</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>17</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>V2054A</td>\n",
       "      <td>Android</td>\n",
       "      <td>上海</td>\n",
       "      <td>上海</td>\n",
       "      <td>A_0_24</td>\n",
       "      <td>female</td>\n",
       "      <td>2021-06-28 12:22:54</td>\n",
       "      <td>7</td>\n",
       "      <td>农村</td>\n",
       "      <td>农村/农业资讯</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>464586545</td>\n",
       "      <td>2021-06-30 09:58:31</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>V2054A</td>\n",
       "      <td>Android</td>\n",
       "      <td>上海</td>\n",
       "      <td>上海</td>\n",
       "      <td>A_0_24</td>\n",
       "      <td>female</td>\n",
       "      <td>2021-06-29 13:25:06</td>\n",
       "      <td>5</td>\n",
       "      <td>娱乐</td>\n",
       "      <td>娱乐/港台明星</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1000541010</td>\n",
       "      <td>465352885</td>\n",
       "      <td>2021-07-03 18:13:03</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>18</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>V2054A</td>\n",
       "      <td>Android</td>\n",
       "      <td>上海</td>\n",
       "      <td>上海</td>\n",
       "      <td>A_0_24</td>\n",
       "      <td>female</td>\n",
       "      <td>2021-07-02 10:43:51</td>\n",
       "      <td>18</td>\n",
       "      <td>娱乐</td>\n",
       "      <td>娱乐/港台明星</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      user_id  article_id           expo_time  net_status  flush_nums  \\\n",
       "0  1000541010   464467760 2021-06-30 09:57:14           2           0   \n",
       "1  1000541010   463850913 2021-06-30 09:57:14           2           0   \n",
       "2  1000541010   464022440 2021-06-30 09:57:14           2           0   \n",
       "3  1000541010   464586545 2021-06-30 09:58:31           2           1   \n",
       "4  1000541010   465352885 2021-07-03 18:13:03           5           0   \n",
       "\n",
       "   exop_position  click  duration  device       os province city     age  \\\n",
       "0             13      1        28  V2054A  Android       上海   上海  A_0_24   \n",
       "1             15      0         0  V2054A  Android       上海   上海  A_0_24   \n",
       "2             17      0         0  V2054A  Android       上海   上海  A_0_24   \n",
       "3             20      0         0  V2054A  Android       上海   上海  A_0_24   \n",
       "4             18      0         0  V2054A  Android       上海   上海  A_0_24   \n",
       "\n",
       "   gender               ctime img_num cat_1    cat_2  \n",
       "0  female 2021-06-29 14:46:43       3    娱乐  娱乐/港台明星  \n",
       "1  female 2021-06-27 22:29:13      11    时尚  时尚/女性时尚  \n",
       "2  female 2021-06-28 12:22:54       7    农村  农村/农业资讯  \n",
       "3  female 2021-06-29 13:25:06       5    娱乐  娱乐/港台明星  \n",
       "4  female 2021-07-02 10:43:51      18    娱乐  娱乐/港台明星  "
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data_new.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "8dac7e38",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data_new.to_csv('data_process/train_data.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99f45da8",
   "metadata": {},
   "source": [
    "新保存的数据集里面小问题记录：\n",
    "1. 这里面很多被点击的新闻并没有新闻画像\n",
    "2. 好多新闻的曝光时间要大于创建时间，这其实是错误数据， 考虑到数据量的关系，先不做处理\n",
    "\n",
    "这些问题， 通过重新采样， 重新处理数据得以解决"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
